-
Notifications
You must be signed in to change notification settings - Fork 14k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/Fix: times_series limit for Druid #3434
Conversation
Coverage increased (+0.3%) to 69.401% when pulling db70e73d3ed5436341e5cfaa4f056ac63dc554f4 on tc-dc:fmenges/druid_timeseries into 3dfdde1 on apache:master. |
1 similar comment
Coverage increased (+0.3%) to 69.401% when pulling db70e73d3ed5436341e5cfaa4f056ac63dc554f4 on tc-dc:fmenges/druid_timeseries into 3dfdde1 on apache:master. |
db70e73
to
ddcb94b
Compare
Coverage increased (+0.1%) to 69.25% when pulling ddcb94b47c3b4cf887abbef4a08cac75d1454c4b on tc-dc:fmenges/druid_timeseries into 147c12d on apache:master. |
ddcb94b
to
5d5090b
Compare
Coverage increased (+0.3%) to 69.402% when pulling 5d5090b76634eeeb137bf00d8e204962caa96f30 on tc-dc:fmenges/druid_timeseries into 8223729 on apache:master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor change requested.
Sorry you had to dig through this piece of code, this is not the code I'm most proud of for sure...
Otherwise looks good, though this large messy method is hard to review and hard to unit test, so I'll have to trust you on it and test against our staging.
Which version of Druid are you running? I was told by some of the Druid core committers that the issue you were seeing (around topN not showing fragments) has been fixed in more recent version, and was hoping that an upcoming upgrade would fix this, but we do see this on our side at the moment.
I'm also hoping that we eventually move Druid to use SQLAlchemy as well, but that'll require a proper SQLAlchemy dialect and that's assuming that Druid offers as much control/performance through the SQL option...
superset/connectors/druid/models.py
Outdated
@@ -798,6 +798,28 @@ def values_for_column(self, | |||
def get_query_str(self, query_obj, phase=1, client=None): | |||
return self.run_query(client=client, phase=phase, **query_obj) | |||
|
|||
def _add_filter_from_pre_query_data(self, df, dimensions, filter): | |||
ret = filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filter
is a reserved word which can be confusing, go for filter_
or filters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah... keep forgetting filter :(
…for each point in time
5d5090b
to
9fe8d73
Compare
We are running 0.10.1, we upgraded from 0.9.2 about 2 weeks ago and saw great performance gains across different query types. This changeset is to handle Druid behavior that is implemented like this by design (this is not addressing or working around a bug in Druid). If you specify a threshold of 5 for a TopN query, Druid will always return the top 5 results per granularity (lets ignore that its not guaranteed they are the actual top 5 but likely the top 5). The argument for this changeset is, that the behavior described is not actually useful and counter intuitive when you want a line-chart for the top 5 Campaigns. What you expect is to see the change over time for the same top 5 Campaigns day by day over the course of the week. We have been running this code for about 3 weeks and have not run into problems. |
…for each point in time (apache#3434) small fix
…for each point in time (apache#3434)
When setting s series limit for druid it currently returns the TopN for each individual datapoint instead of the the TopN for the entire time frame. See the before and after pictures attached.
This is probably more of a bug fix since the behavior after this change is consistent with the behavior that you see when you group by multiple columns.
Before:
After: